Note: This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Description of Datasets

The starting datsets are the 94 families of the HomFam protein sequences.

See Data Generation Workflow for steps to reproduce the datasets used in this analysis.

For each family and for each “number of sequences”, 10 datasets where generated.

These datasets can be download from here.

ClustalO Standard Alignments

These alignment where generated with the following command:

clustalo --infile=${datasetID}.${size}.${rep}.fa 
         --outfmt=fa 
         --force 
         -o ${datasetID}.${size}.${rep}.aln

ClustalO Regressive Alignments

These alignment where generated with the following command:

t_coffee -dpa -dpa_method clustalo_msa \
         -dpa_tree ${guide_tree} \
         -seq ${seqs} \
         -dpa_nseq ${bucket_size} \
         -outfile ${id}.${size}.${rep}.dpa.${bucket_size}.${align_method}.with.${tree_method}.tree.aln

ClustalO Guide Trees

All guide trees for each dataset were generated using the following command:

clustalo -i ${seqs} --guidetree-out "${id}.${tree_method}.${size}.${rep}.dnd"

The same tree was used for the DPA tree and Standard ClustalO guide tree for each dataset.

Data Generation Workflow

The data used in this analysis was generated from a Nextflow workflow.

Nextflow is a framework that enables portable and reproducible workflows.

You can find the GitHub respository for the workflow here

You can to generate the data yourself with the following steps:

# Download Nextflow
wget -qO- https://get.nextflow.io | bash

# Run the example dataset
./nextflow run skptic/embeded-analysis-nf

R Data Analysis

install.packages("plotly", repos="http://cran.rstudio.com/", dependencies=TRUE)
library(plotly)
clustalo_std_raw <- read.csv("~/Downloads/heatmap_data_clustalo_std.csv", row.names=1)
clustalo_reg_raw <- read.csv("~/Downloads/heatmap_data_clustalo_dpa.csv", row.names=1)
clustalo_std_raw
clustalo_reg_raw
clustalo_std_norm=apply(clustalo_std_raw, 1, function(x){x/x[1]})
clustalo_std_norm_t=t(clustalo_std_norm)
clustalo_reg_norm=apply(clustalo_reg_raw, 1, function(x){x/x[1]})
clustalo_reg_norm_t=t(clustalo_reg_norm)
plot_ly(x=colnames(clustalo_std_norm_t), y=rownames(clustalo_std_norm_t), z = clustalo_std_norm_t, type = "heatmap") %>% layout(yaxis = list(autorange = "reversed"))
plot_ly(x=colnames(clustalo_reg_norm_t), y=rownames(clustalo_reg_norm_t), z = clustalo_reg_norm_t, type = "heatmap") %>% layout(yaxis = list(autorange = "reversed"))
LS0tCnRpdGxlOiAiRmlndXJlIDE6VGhlIGVmZmVjdCBvZiB0aGUgbnVtYmVyIG9mIHNlcXVlbmNlcyBvbiBhbGlnbm1lbnQgYWNjdXJhY3kgb2Ygc3RhbmRhcmQgYW5kIHJlZ3Jlc3NpdmUgYWxpZ25lbW50cyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKTm90ZTogVGhpcyBpcyBhbiBbUiBNYXJrZG93bl0oaHR0cDovL3JtYXJrZG93bi5yc3R1ZGlvLmNvbSkgTm90ZWJvb2suIFdoZW4geW91IGV4ZWN1dGUgY29kZSB3aXRoaW4gdGhlIG5vdGVib29rLCB0aGUgcmVzdWx0cyBhcHBlYXIgYmVuZWF0aCB0aGUgY29kZS4gCgojIyBEZXNjcmlwdGlvbiBvZiBEYXRhc2V0cwpUaGUgc3RhcnRpbmcgZGF0c2V0cyBhcmUgdGhlIDk0IGZhbWlsaWVzIG9mIHRoZSBIb21GYW0gcHJvdGVpbiBzZXF1ZW5jZXMuCgpTZWUgRGF0YSBHZW5lcmF0aW9uIFdvcmtmbG93IGZvciBzdGVwcyB0byByZXByb2R1Y2UgdGhlIGRhdGFzZXRzIHVzZWQgaW4gdGhpcyBhbmFseXNpcy4KCkZvciBlYWNoIGZhbWlseSBhbmQgZm9yIGVhY2ggIm51bWJlciBvZiBzZXF1ZW5jZXMiLCAxMCBkYXRhc2V0cyB3aGVyZSBnZW5lcmF0ZWQuCgpUaGVzZSBkYXRhc2V0cyBjYW4gYmUgZG93bmxvYWQgZnJvbSBbaGVyZV0oKS4KCiMjIyBDbHVzdGFsTyBTdGFuZGFyZCBBbGlnbm1lbnRzIAoKVGhlc2UgYWxpZ25tZW50IHdoZXJlIGdlbmVyYXRlZCB3aXRoIHRoZSBmb2xsb3dpbmcgY29tbWFuZDoKCmBgYHtiYXNofQpjbHVzdGFsbyAtLWluZmlsZT0ke2RhdGFzZXRJRH0uJHtzaXplfS4ke3JlcH0uZmEgCiAgICAgICAgIC0tb3V0Zm10PWZhIAogICAgICAgICAtLWZvcmNlIAogICAgICAgICAtbyAke2RhdGFzZXRJRH0uJHtzaXplfS4ke3JlcH0uYWxuCmBgYAoKIyMjIENsdXN0YWxPIFJlZ3Jlc3NpdmUgQWxpZ25tZW50cwoKVGhlc2UgYWxpZ25tZW50IHdoZXJlIGdlbmVyYXRlZCB3aXRoIHRoZSBmb2xsb3dpbmcgY29tbWFuZDoKCmBgYHtiYXNofQp0X2NvZmZlZSAtZHBhIC1kcGFfbWV0aG9kIGNsdXN0YWxvX21zYSBcCiAgICAgICAgIC1kcGFfdHJlZSAke2d1aWRlX3RyZWV9IFwKICAgICAgICAgLXNlcSAke3NlcXN9IFwKICAgICAgICAgLWRwYV9uc2VxICR7YnVja2V0X3NpemV9IFwKICAgICAgICAgLW91dGZpbGUgJHtpZH0uJHtzaXplfS4ke3JlcH0uZHBhLiR7YnVja2V0X3NpemV9LiR7YWxpZ25fbWV0aG9kfS53aXRoLiR7dHJlZV9tZXRob2R9LnRyZWUuYWxuCmBgYAoKIyMjIENsdXN0YWxPIEd1aWRlIFRyZWVzCgpBbGwgZ3VpZGUgdHJlZXMgZm9yIGVhY2ggZGF0YXNldCB3ZXJlIGdlbmVyYXRlZCB1c2luZyB0aGUgZm9sbG93aW5nIGNvbW1hbmQ6CgpgYGB7YmFzaH0KY2x1c3RhbG8gLWkgJHtzZXFzfSAtLWd1aWRldHJlZS1vdXQgIiR7aWR9LiR7dHJlZV9tZXRob2R9LiR7c2l6ZX0uJHtyZXB9LmRuZCIKYGBgCgpUaGUgc2FtZSB0cmVlIHdhcyB1c2VkIGZvciB0aGUgRFBBIHRyZWUgYW5kIFN0YW5kYXJkIENsdXN0YWxPIGd1aWRlIHRyZWUgZm9yIGVhY2ggZGF0YXNldC4KCgojIyBEYXRhIEdlbmVyYXRpb24gV29ya2Zsb3cgClRoZSBkYXRhIHVzZWQgaW4gdGhpcyBhbmFseXNpcyB3YXMgZ2VuZXJhdGVkIGZyb20gYSBbTmV4dGZsb3ddKCkgd29ya2Zsb3cuCgo8ZGl2IHN0eWxlPSJ3aWR0aDoxMDBweDsgaGVpZ2h0OjIwcHgiPgohW10oaHR0cHM6Ly9naXRodWIuY29tL25leHRmbG93LWlvL3RyYWRlbWFyay9yYXcvbWFzdGVyL25leHRmbG93MjAxNF9uby1iZy5wbmcpCjwvZGl2PiAKCk5leHRmbG93IGlzIGEgZnJhbWV3b3JrIHRoYXQgZW5hYmxlcyBwb3J0YWJsZSBhbmQgcmVwcm9kdWNpYmxlIHdvcmtmbG93cy4KCllvdSBjYW4gZmluZCB0aGUgR2l0SHViIHJlc3Bvc2l0b3J5IGZvciB0aGUgd29ya2Zsb3cgW2hlcmVdKGh0dHBzOi8vZ2l0aHViLmNvbS9za3B0aWMvZW1iZWRlZC1hbmFseXNpcy1uZi90cmVlL21hc3Rlci90ZW1wbGF0ZXMpCgpZb3UgY2FuIHRvIGdlbmVyYXRlIHRoZSBkYXRhIHlvdXJzZWxmIHdpdGggdGhlIGZvbGxvd2luZyBzdGVwczoKCmBgYHtiYXNofQojIERvd25sb2FkIE5leHRmbG93CndnZXQgLXFPLSBodHRwczovL2dldC5uZXh0Zmxvdy5pbyB8IGJhc2gKCiMgUnVuIHRoZSBleGFtcGxlIGRhdGFzZXQKLi9uZXh0ZmxvdyBydW4gc2twdGljL2VtYmVkZWQtYW5hbHlzaXMtbmYKYGBgCgojIyBSIERhdGEgQW5hbHlzaXMKCiogUHJlcmVxdWlzaXRlczogSW5zdGFsbCBhbmQgbG9hZCBwYWNrYWdlcwpgYGB7cn0gCmluc3RhbGwucGFja2FnZXMoInBsb3RseSIsIHJlcG9zPSJodHRwOi8vY3Jhbi5yc3R1ZGlvLmNvbS8iLCBkZXBlbmRlbmNpZXM9VFJVRSkKbGlicmFyeShwbG90bHkpCmBgYAoKCiogU3RlcCAxOiBJbXBvcnQgdGhlIGFsaWdubWVudCBkYXRhc2V0cwpgYGB7cn0KY2x1c3RhbG9fc3RkX3JhdyA8LSByZWFkLmNzdigifi9Eb3dubG9hZHMvaGVhdG1hcF9kYXRhX2NsdXN0YWxvX3N0ZC5jc3YiLCByb3cubmFtZXM9MSkKY2x1c3RhbG9fcmVnX3JhdyA8LSByZWFkLmNzdigifi9Eb3dubG9hZHMvaGVhdG1hcF9kYXRhX2NsdXN0YWxvX2RwYS5jc3YiLCByb3cubmFtZXM9MSkKYGBgCgoqIFN0ZXAgMjogRGlzcGxheSB0aGUgQ2x1c3RhbE8gU3RhbmRhcmQgQWxpZ25tZW50IERhdGFzZXQKYGBge3J9CmNsdXN0YWxvX3N0ZF9yYXcKYGBgCgoqIFN0ZXAgMzogRGlzcGxheSB0aGUgQ2x1c3RhbE8gUmVncmVzc2l2ZSBBbGlnbm1lbnQgRGF0YXNldApgYGB7cn0KY2x1c3RhbG9fcmVnX3JhdwpgYGAKCiogU3RlcCA0OiBOb3JtYWxpc2UgYm90aCBkYXRhc2V0cyBieSB0aGUgZmlyc3QgY29sb3VtbiBhbmQgdGhlbnRyYW5zcG9zZQpgYGB7cn0KY2x1c3RhbG9fc3RkX25vcm09YXBwbHkoY2x1c3RhbG9fc3RkX3JhdywgMSwgZnVuY3Rpb24oeCl7eC94WzFdfSkKY2x1c3RhbG9fc3RkX25vcm1fdD10KGNsdXN0YWxvX3N0ZF9ub3JtKQoKY2x1c3RhbG9fcmVnX25vcm09YXBwbHkoY2x1c3RhbG9fcmVnX3JhdywgMSwgZnVuY3Rpb24oeCl7eC94WzFdfSkKY2x1c3RhbG9fcmVnX25vcm1fdD10KGNsdXN0YWxvX3JlZ19ub3JtKQpgYGAKCgoKKiBTdGVwIDUgUGxvdCB0aGUgQ2x1c3RhbE8gU3RhbmRhcmQgQWxpZ25tZW50IFNjb3JlczogCmBgYHtyfQpwbG90X2x5KHg9Y29sbmFtZXMoY2x1c3RhbG9fc3RkX25vcm1fdCksIHk9cm93bmFtZXMoY2x1c3RhbG9fc3RkX25vcm1fdCksIHogPSBjbHVzdGFsb19zdGRfbm9ybV90LCB0eXBlID0gImhlYXRtYXAiKSAlPiUgbGF5b3V0KHlheGlzID0gbGlzdChhdXRvcmFuZ2UgPSAicmV2ZXJzZWQiKSkKYGBgCgoqIFN0ZXAgNSBQbG90IHRoZSBDbHVzdGFsTyBSZWdyZXNzaXZlIEFsaWdubWVudCBTY29yZXM6IApgYGB7cn0KcGxvdF9seSh4PWNvbG5hbWVzKGNsdXN0YWxvX3JlZ19ub3JtX3QpLCB5PXJvd25hbWVzKGNsdXN0YWxvX3JlZ19ub3JtX3QpLCB6ID0gY2x1c3RhbG9fcmVnX25vcm1fdCwgdHlwZSA9ICJoZWF0bWFwIikgJT4lIGxheW91dCh5YXhpcyA9IGxpc3QoYXV0b3JhbmdlID0gInJldmVyc2VkIikpCmBgYAo=